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Abstract 



Pr?v O rr P 35 T targCt langUagC for im P lemcnlin g higher-order languages, 
to rn at C °T m8 SUch * an 8 ua gcs to C have produced efficient code, but have had 

o comprom.se on cither the portability or the preservation of the tail-recursive properties of 

6 ,angUagcs - Wc asscrt lhat ncitl,cr of these compromises is necessary for the generation 
of efficient code. We offer a Standard ML to C compiler, which does not makf either of 

“f “ CXi < lCnCC Pr0 ° f - ThC 8C "f ralCd COdc achicvcs 311 executi °n speed 
that is just a factor of two slower than the best native code compiler In this naner we 

describe the design, implementation and the performance of this compiler. ? ? 
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1 Introduction 



Implementors of new programming languages arc faced with a dilemma: whether to sacrifice 
efficiency for portability or the other way around. One approach that has been successfully 
used in the past to avoid writing a code generator for each architecture is to compile to C, 
using C as a universal intermediate language. C makes an efficient inteimediate language 
because it is relatively dose to assembly language, yet mostly machine independent, and 
compilers for C are available on most machines. 

Bartlett [7] has shown that it is possible to compile Scheme programs into efficient C pro- 
pams fit his approach, constructs in Scheme are mapped to apparently similar constructs 
in C Unfortunately, this conceptually simple approach leads to several compromises. First, 
the Scheme implementation fails to reflect the pragmatics of the language: proper tail- 
recursion is lost, even though a compile-time analysis is used to recover some tail-recursion 
information. Second, features such as first-class continuations (call/cc) and garbage collec- 
tion require the use of assembly language. The whole point of compiling to C, though is to 
avoid such machine dependencies. 



Thus, Bartlett’s implementation raises several questions. How can Scheme-like languages 
be compiled without any use of assembly language? Can tail-recursive languages such as 
scheme be compiled to C and retain the property of proper tail-recursion? Can this be done 
without an unacceptable loss of efficiency? 

£ ^ thcsc 4 ucstions by describing our experience with compiling 

tandard ML [12], a language which has a dynamic semantics similar to that of Scheme to 
C. We have been able to successfully compile ail of Standard ML into efficient and portable 
C code which runs on 32-bit architectures 1 . In doing so, we have completely avoided the use 
of assembly language. In addition, our code generator handles extensions to Standard ML 

tL C l h “ y ” Ch T° US Signal handIin S- 0ur codc * typically about two times slower 
than the code produced by the best available Standard ML compiler, which compiles to 

°^ C 8 “f hir f COde ; 0ur ^Pigmentation runs on Sun-3s, Sparcstations, Decstations and 
a 80486-based machine. To our best knowledge, no full implementation of Standard ML is 
available for the 80x86-based machines. 



We discuss the basic approach we used for compiling ML to C and our reasons for choosing 
is approach We then discuss some limitations of the approach and the optimizations we 
used to significantly improve the performance of the generated C code. We present a series 
of benchmarks to support our claim of having achieved good performance relative to the 
best available native code compiler. Finally, we present our conclusions. 



2 Background 

2.1 Related Work 



esides Bartlett s work, other related work includes the portable Cedar project [61. In 

T CX1SUnS Comp,lcr for Ccdar was successfully retargeted to C in order to port 

imnlerne ? r ThCy achicving ver y Sood code performance for their 

implementations on the SPARC and Motorola 68020. 

‘Th C runtime system, however, currently requires some version of the Unix 4.3 BSD operating system. 
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Our work differs significantly from this work in several aspects. Cedar does not have 
the requirement of proper tail-recursion and does not have functions as first-class values. 
Their implementation used highly machine and C compiler dependent code to traverse 
the C call stack to implement exception handling. Our implementation does not use any 
machine-dependent code and is completely portable across a wide class of machines. 

2.2 Standard ML 

Standard ML (SML) is a modem programming language embodying many innovations 
in programming language design. It is a lexically-scoped mosdy-functional programming 
language; functions are first-class values. It is statically-typed but unlike other statically- 
typed languages (Pascal, Ada), the types arc automatically inferred by the compiler and 
the type system is polymorphic. It has a sophisticated modules mechanism for developing 
large programs which typechccks the interface between modules, like in Ada or Modula. It 
provides a type-safe dynamically-scoped exception mechanism to handle unusual or deviant 
conditions. It provides garbage collection, like most other languages of its type. It also 
provides complete runtime safety — in particular, programs never “dump core”. 

As the can be seen from the previous paragraph, there is considerable semantic distance 
between Standard ML and C. Features like dynamically-scoped exceptions, garbage col- 
lection, complete run-time safety, and higher-order functions all pose problems for a C 
implementation of SML. 



3 Compilation Strategy 

Languages like Standard ML and Scheme can be regarded as syntactically sugared versions 
of the A-calculus. The general strategy for compiling such languages is to unsugar pro- 
grams to a simple call-by-value A-calculus augmented with a branching operation, record 
operations, and a set of primitive operators. In the case of Standard ML, the conversion 
also removes the type information. This reduces the problem of compiling Standard ML to 
a more manageable size, since this A-calculus is much smaller. 

There are two basic approaches to compiling the A-calculus. The first approach is to 
use continuation-passing style (CPS). This approach has been used successfully in Rabbit, 
Orbit and the Standard ML of New Jersey compiler [15, 11, 3]. CPS uses a A-calculus 
based intermediate representation meeting the invariants that function applications never 
be nested and that function calls always be tail-recursive. Since function calls are always 
tail-recursive, the first call to return is also the only call to return. Thus, a function call is 
transformed into a goto with arguments. Any program in the A-calculus language can be 
converted into CPS by a simple O(n) transformation [9]. 

There are several advantages of using CPS as an intermediate representation [15, 10]. First, 
all intermediate values arc explicitly named. Second, control-flcw is explicitly available 
via continuations. This makes it easy to implement exceptions and call/cc. Third, since all 
tail-calls turn into jumps, tail-recursion elimination is achieved for free. Fourth, the target 
machine for a CPS program is simple. It requires a set of registers, a heap on which one 
can allocate records, a jump instruction, and a small set of instructions which can be used 
to implement the CPS primitive operators [3]. 
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The second approach to compiling the A-calculus is to use a stack to hold activation records 
for functions as the functions are evaluated. When the evaluation of an expression yields a 
unction, a closure is built. When this function is applied, the values for the lexically scoped 
variables are fetched from the enviomment part of this closure. 

The stack-based approach is more complicated. Recall that the exception mechanism of ML 
is dynamically-scoped. Therefore when an exception occurs the stack must be unwound 
unul an exception handler for the particular exception is found. Using a stack also makes it 
difficult to preserve proper tail recursion. If the last thing a function / does is call another 

fh C "° n /'^ Cn n thc " ew actlvallon record for 8 m ust replace the activation record for / on 
the stack. Finally, the presence of a stack complicates the implementation of the call/cc 
extension to Standard ML [8]. 

These complications become serious problems if we tiy to map ML constructs to apparently 
sum ar C constructs. We can map ML functions to corresponding C functions by flattening 
out all functions to a single lexical level. Closures can be represented by using a record 
containing a C function pointer and the environment for the function 2 . This, however 
does not solve any of the problems mentioned above. First, one must implement some 
cumbersome method for handling exceptions. Second, one has lost the property of proper 

is th^TXh ^ fU lh ‘ S f ° r functions which immediately tail recursive (that 

is, they cal themselves). It is not obvious how to fix this for mutually recursive tail-calls 

and tail-calls to functions that are not known until runtime, which are both quite frequent 
n a programming style using higher-order functions. Finally, it is impossible to implement 
the call/cc extension portably in C, since there is no portable way to save the stack of the 
entire program, and then restore the stack at an arbitrary point in time. 3 



4 Design of the Compiler 

Based on the analysis piesenled in the previous seoion, we ciose to use the CPS approach 
to code generation and to treat C as a target assembly language. 

Appel and Jim [3] proposed a variant of CPS that they called continuation-passing closure- 
passing style. Tins is a refinement of the approach to code generation taken in Orbit [10] In 
this vanant, all functions have been flattened out to one lexical level and record operations 
are used to explicitly represent closures. P 

An example of the transformation is shown in Figures 1 and 2. Figure 1 contains SML code 
and Figure 2 contains the representation of the same code after the transfoimation. 

The program in Figure 2 is presented in a pseudo-code style: we have not expanded infix 

t0 *®i r tn f < r PS form - Note tot to converted code looks remarkably similar to 
C code, except for the fact that all function calls are tail-recursive. 

The continuation-passing, closure-passing style approach to code generation has been im 
p ement^ m the Standard ML of New Jersey (SML/NJ) compiler [4], a publicly available 
freely redistributable optimizing compiler developed at AT&T and Princeton University It 

rr^es of mod r op rr g compiicr for standa * ^ ** 

_to senes of modules which repe atedly transform the program in various intermediate 

^Banleu used a similar approach in his Scheme— C compiler. 

Longjmp and sctjmp cannot do this! 
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"bar" 



fun foo x = 

let fun bar 0 = 

| bar x = baz (x-1) 

and baz 0 = "baz" 

1 baz x = bar (x-1) 

in bar x 
end 

Figure 1: SML code before transformation 

fun fool(x,c) = barl(x,c) 

and barl (x, c) = if x=0 then c "bar" else bazl(x-l,c) 
and bazl (x, c) = if x=0 then c "baz" else barl (x-1, c) 

Figure 2: SML code after transformation 

languages, the end result being functions represented in the continuation-passing, closure- 
passing style. This makes it easy to replace modules at different stages in the compilation. 

We chose to reuse the SML/NJ modules that precede the code-generation stage, namely 
the front-end, CPS conversion and CPS optimization phases. This allowed us to focus our 
attention on the issues involved in translation to C. It also allowed us to directly compare the 
performance of our implementation with the best available native code generator, factoring 
out any differences that might arise due to different compilation strategies. In addition, this 
permitted us to reuse the well-designed runtime system of SML/NJ [2]. 

This decision reduced our problem to compiling the CPS representation of SML programs 
as generated by the SML/NJ compiler. The CPS representation is shown in Figure 3. 

Continuation expressions, or values of type cexp, represent programs. Most variants of 
the cexp type bind zero or more variables whose scope is another continuation expression. 
Variable names are represented by values of type lvar. In practice, each variable name is 
designated by a unique integer. 4 

The target machine required to efficiently implement this language is very simple. The 
machine configuration consists of: 

• A heap, i.e., a contiguous block of memory 

• A set of general purpose registers 

• Reserved registers for holding the following: 

- Heap pointer 

- Heap limit pointer 

- Current exception continuation 

- Arithmetic temporaries 

4 For more information about the CPS language, refer to [3]. 
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cexp 



datatype value = VAR of lvar 
I LABEL of lvar 
I INT of int 
I REAL of string 
I STRING of string 



datatype cexp = RECORD of (value 



accesspath) 



i SELECT of int * value * lvar * cexp 
I OFFSET of int * value * lvar * cexp 
I APP of value * value list 

I FIX of (lvar * lvar list * cexp) list * cexp 
I SWITCH of value * cexp list 

I PRIMOP of primop * value list * lvar list - cexp list 



Figure 3: CPS language used by Standard ML of New Jersey 

maciSr t T* 116 iS ^ d0SC t0 ^ f0Und 0n raost conventional 

” 11 include s load, store, logical operations, arithmetic operations with and without 

rtow checking floaung operations, register-register move, condiuonal branches and 
jumps. The instructions can be labeled. 



5 Implementation 

It IS straightforward to implement the target machine resources in C. Registers are imnle 
mented using global variables and the heap is implemented by an array of integers. 

Most target machine instructions are also straightforward to implement in C. The only ones 

SiflTSi 2 P m CmS r **“■■ JUmP r inStmCti0n 5 311(1 °P eraI lons with 

labeN ^ nor fi "f i of the jump instruction is problematic because 

labels are not first-class values in C. In particular, we cannot store a label in memorv or 

bS ofCcod m e ^. arbltrary P° int in 1116 C P r °e rani - The only way to get the address of a 
block of C code is to encapsulate it in a C function. Since jumps are not supposed to return 

TveXr USC fUnCU ° n CaUS 10 implcmcnt J um P s = f ° r ^ ^ did, the stack would quickly 
Instead, we use a technique from Rabbit which Steele called the UUO handler and which we 

instmetions wUh overflow checking is difficul. sinM 
plementations of C lend to ,gn 0 I e overflow. For Standanl ML, however, overflow 
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int fool (), barl (), bazl () ; 



int apply (start) 
int (*start) () ; 

{ while (1) start = (int (*)()) (*start)();} 

int fool () 

{ return ((int) barl); } 
int barl () 

{ if (R1==0 ) { R1 = "bar"; return R2; } 

else { R1 = R1 - 1; return ((int) bazl); } 

} 

int bazl () 

{ if (R1==0) { R1 = "baz"; return R2) ; } 

else { R1 « R1 - 1; return ((int) barl); } 

} 



Figure 4: C code and apply-like procedure 

checking is mandated by the language definition. Given the usual ranges for integers 
on two’s complement machines, addition, subtraction, multiplication, and division can all 
overflow. To implement overflow checking, we had to add explicit checks which used 
bitwise operations on the operands and the result. 

For register allocation, we reused the algorithm used in SML/NJ. The algorithm is divided 
into the spilling phase and the register assignment phase. In the spilling phase, the CPS 
programs are rewritten before instruction selection so that no subexpression has more than 
n free variables, where n is related to the number of actual registers. 6 This guarantees that 
every variable, at any point in the program, can be placed in a register. Register assignment 
is then done on the fly using register-tracking. To assign a register to a variable, we calculate 
the set of registers that arc already used by variables that are free in the body of the CPS 
expression binding the variable. We then choose some unused register for the variable. The 
register assignment phase uses a variety of heuristics to try to minimize shuffling of registers 
at function calls. 

Figure 5 contains a simplified version of the C code that would be generated for the sample 
SML code shown in Figure 1. The figure also shows the apply-like procedure stripped of 
its initialization code. 



6 Benchmarks 

We benchmarked three versions of our system and the SML/NJ compiler on two different 
classes of architectures. The three versions are unoptimized (Unopt), optimized (Opt-s) and 

4 In our case, we set the number of registers to 31. This number arises from the runtime system 
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optimized with unsafe arithmetic (Opt-u). The unoptimized vemion is a straightforward 
implementation of the design described in Section 5. The optimized version uses the 
lques presented in Section 7 to optimize the generated C code. The optimized version 
wi^ unsafe arithmetic gives an idea of the performance penalty imposed by the unavailability 
of arithmetic operations with overflow in C. y 

The benchmark suite is a combination of real programs in regular use and programs created 
to exercise particular aspects of implementations. The suite is: 

• insert: insertion sort of a pseudo-randomly generated list of 2000 integers. 

typical doubly exponential version of Fibonacci used in introductory programming 



fib: 

classes. 



• ml-yacc: Parser generator based on the Una yacc tuiliiy turning on the aaual parser 
specification for SML/NJ [16], It is approximately 6500 lines of SML code. 

* ml-lex: Lexer generator based on the Unix lex utility running on the actual lexer 
specification for SML/NJ [5], It is approximately 1200 lines of SML code. 

^ ° f 1116 SML/NJ dis “*ution and are in widespread use. For 
exanpie, they are used to generate the parser and the lexer for SML/NJ itself. We plan 

nmn | . . more benchmarics in near future. They include the sml2c compiler 

muTStLr Um ^ Pro§ram ' 3 bCI,Chmark to exercise ^ and integer matrix 

5^00 wifh? 9 C ^ hmark ? n ' SUn 3/140 WHh 16 ° f mem01 ^ “d and a Decstation 

wMrhT k T y u ° em017 - We uscd bui,t - in tim ‘ n g functions available in SML/NJ 

which are based on the Unix getrusage utility. We factored out the garbage collection costs 

sLe^f nUmterS reprCSCm ° nly Codc s P ced - Garba ge collection costs are irrelevant for the 
sake of comparison since all versions of code generated by sml2c and SML/NJ generate the 
same amount of garbage and use the same garbage collector: 

The C code was compiled on the Sun-3 usinggee [ 14] with the -O flag to enable optimization 

2, s>ra^ 

We found that our benchmark programs were about twice as slow as the native code promams 
produced by SML/NJ. The running times of the real programs with safe arithmetic^ 
not significantly worse than those with unsafe arithmetic. This indicates, of course that the 

Ttebc mOSt 0f t,K ' r time < “ >i ” 8 symb0lic con, P ulati » n “”<1 very little time doing 

to bodt to'sii'n'a ^ h ° f our staI,< l _a l 0 nc executables with those generated by SML/NJ 

for both the Sun-3 and the Decstation 5000. The executables were stripped of symbol tahle 

information to obtain a better mcasum of the code size. Sizes of the exeeutabte to each 

executabl^with^Df “ !? f ° U " d ™’'° S 3 “ d 4 - Wc f °“"“ that the size of our 

produced by SMlXj. ' amhmctic was well within a factor of two of those 
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Benchmarks 


SML/NJ (a) 


Opt-u (j 3) 


0/<* 


Opt-s (<5) 


6/a 


Unopt ( 7 ) 


7 /a 


insert 


10 . 8 s 


20 . 8 s 


1.92 




1.84 


37.5s 


3.47 


fib 


20 . 0 s 


40.0s 


2.00 


50.6s 


2.53 


110 . 0 s 


5.50 


ml-yacc 


45.6s 


88 . 2 s 


1.93 




1.95 


141.8s 


3.10 


ml-lex 


163.4s 


276.8s 


1.69 


273.7s 


1.67 


464.7s 


2.83 



Table 1: Benchmarks for Sun 3/140 



Benchmarks 


SML/NJ (a) 


Opt-u (0) 


0/a 


Opt-s ( 6 ) 


6/a 


Unopt ( 7 ) 


7 /a 


insert 


2.14s 


3.39s 


1.58 


3.36s 


1.57 


5.80s 


2.71 


fib 


2.73s 


6.82s 


2.50 


7.94s 


2.91 


11.15s 


4.08 


ml-yacc 


4.94s 


9.20s 


1.86 


9.80s 


1.98 


16.85s 


3.41 


ml-lex 


15.43s 


26.95s 


1.74 


26.81s 


1.74 


50.36s 


3.26 



Table 2: Benchmarks for Dccstation 5000 



Benchmarks 


SML/NJ (a) 


Opt-u (0) 


0/<x 


Opt-s (£) 


6/a 


'Unopt ( 7 ) 


7 /a 


insert 


197K 


279K 


1.41 


279K 


1.41 


418K 


2.12 


fib 


172K 


279K 


1.62 


279K 


1.62 


41 8K 


2.43 


ml-yacc 


409K 


63 IK 


1.52 


63 IK 


1.52 


983K 


2.40 


ml-lex 


229K 


377K 


1.64 


385K 


1.68 


582K 





Table 3: Size of executables for Sun 3/140 



Benchmarks 


SML/NJ (a) 


Opt-u {0) 


0/a 


Opt-s ( 6 ) 


6/a 


Unopt ( 7 ) 


7 /a 


insert 


295K 


332K 


1.12 


335K 


1.13 


507K 


1.72 


fib 


MIK2&9 


332K 


1.21 


Ha 


1.22 




1.85 


ml-yacc 


6 iok 


844K 


1.38 




1.40 


1348K 


2.21 


ml-lex 


360K 


479K 


1.33 


49 IK 


1.36 


745K 


2.07 



Table 4: Size of executables for Decstation 5000 
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7 Optimizations 



There are a number of problems with the implementation described in the Section 5. The 

generated C code fails to make effective use of registers, functions calls are expensive, and 
arithmetic is expensive. 

Recall that the target machine registers are implemented using global variables. Most C 
compilers’ will not move global variables into real registered], This is true for a number of 
reasons. First, they cannot be sure whether any write via a pointer can possibly affect global 
variables since they compile only on a file-by-file basis. Some other separately compiled 
file may capture the address of a global variable and then pass it on to code in another file 
Second, moving global variables into registers affects the semantics of signals. If a global 
vanablc is moved to a register for the duration of the execution of a function, it cannot be 
updated by any signal handler for the execution of a function. Since global variables are 

fs'problematic 2 ^ ^ S1SDal handlCrS Can commun matc to normally executing code, this 

Functions calls are expensive since every function call involves a return to the apply-like 
procedure and then a call -by the apply-like procedure. This involves two jumps Y The 
expense is usually more than this, since many C compiler implement the return by a jump 
to piece of code at the end of a function which cleans things up and then does the itutd 

J rZht 3Ck t0 C appIy ' like procedure - addition, if we did have values in registers, we 
J ‘ some : expense to save register values when entering a function and to restore 

register values when exiting the function. This actually occurs in practice on callee-save 

lb °7) T app!y ' Iike pr0ccdurc has no variables live across function calls. 

of loaded store! UnCUOn Ca “ * jumpS ^ S ° mC indc *™™e number 

Finally, integer arithmetic is expensive because most C compilers provide no way to use 

ctoTddifi^Id 1 Tht' ThUS ’ ^ 30 CXPCnSe ° f d ° ing eXpHcit SOftware oveiflow 

f i SU , btraCtl0n rcc * UJrc com Plicatcd explicit bitwise tests and a function 
all needed for multiplication. For division, some comparisons and branches suffice. 

We designed and implemented a scries of optimizations to deal with these problems We 
present them m the following subsections. 



7.1 Register Caching 

To rn^ce effective use of real registers, we cache target machine register in local variables 
for the duration of function calls. Most C compilers attempt to place local variables in 
registers when possible. This caching is done when it is worthwhile. 

Our register caching optimization uses a simple static count of the number of uses of a 

wc handie ^ by ncvcr ^ ^ 

register and spilling the heap pointer register to the corresponding global variable before 
xecuting any mstruetton that may cause a machine fault. Fortunately, the only instmetions 
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which may cause machine faults in our implementation are division by zero and floating 
point operations, so the need to spill is fairly rare. 

The register caching optimization could be improved, especially for small functions, by 
doing some flow analysis to detect loops. In addition, we should only count uses along each 
possible path through a function body, not the entire function body. We could also do some 
peephole optimization to eliminate situations where we first write to a local variable, and 
then write the local variable to a global variable. C compilers will typically not optimize 
t his , for the same reasons that they will not move global variables into registers. 

7 2 Function integration 

A known function is a function whose call sites arc all known. If all call paths to a known 
function/ pass through a single function#, we can integrate/ into the body of g. This means 
that/ gets placed in the body of g and that all calls to/ tum into gotos. We also perform 
direct tail-recursion elimination, turning all calls to g within its body to gotos. Function 
integration is useful for avoiding passes through the apply-like function. In particular, it is 
useful for eptimizing tight tail-recursive loops. 

Computing whether all call paths to a known function pass through another function can 
be cast as the cl ass ical problem of computing dominators in a call graph with multiple start 
nodes, where each unknown function is a start node. An algorithm for computing dominator 
information can be found in [1], A maximal dominator is defined to be a function which 
dominates itself. This captures the intuitive concept of the highest-level dominator. After 
computing the set of dominators for each known function, we can integrate the known 
function into its maximal dominator provided that its maximal dominator is not itself. 

Consider the SML code in Figure 1. The functions bar and baz are known and f oo is the 
ma ximal dominator for them. Figure 2 shows the pseudo-CPS code that represents these 
functions as they are presented to the code generator. Figure 5 shows a simplified version 
of C code generated for these functions with function integration and register caching. Note 
that the mutually tail-recursive functions have been compiled into a tight loop with the loop 
counter placed in a register. 7 

7 3 Modified overflow checks 

We can simplify the overflow checks by constant-folding them in the case where one operand 
is known at compile time. For example, the overflow check for addition can be reduced to a 
single comparison and branch from a complicated boolean expression involving 5 operators 
and the procedure call for multiplication can be replaced with two compares and a branch. 

7.4 Benchmarks 

To show the benefits of the optimizations, we compiled and timed our benchmark programs, 
enabling only one optimization at a time. We measured the performance of the register 
caching (reg), function integration (integ), and arithmetic optimizations (arith). We also 
measured the performance of the tail-recursion elimination optimization without function 

’The redundant gotos appearing in the code are optimized away by mostC compilers. 
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int fool ( ) ; 
int fool () 

{ register int rl; 

rl=Rl; 
goto barl; 
barl : 

(rl=«0) { rl = "bar"; Rl-rl; return ((int) R2) ; } 

else { rl = rl - 1; goto bazl; } 
bazl ; 

-‘- f (rl-=0) { rl = "baz"; Rl=rl; return ((int) R2); } 

else { rl = rl - 1; goto barl; } 



Figure 5: Optimized C code 



Benchmarks 


unopt 


reg integ 


tail 


arith 


opt 


insert 


37.5s 


28.6s 


32.3s 


30.1s 


34.7s 


19.9s 


fib 


110.0s 


76.7s 


92.6s 


101.3s 


86.3s 


50.6s 


ml-yacc 


141.8s 


rTl 8.5s 


131.4s 


138.2s 


^l34.8s 


89.2s 


mi-lex 


464.7s 


419.2s 


432.5s 


462.0s 


469.7s 


273.7s 



Tabic 5: Benchmarks for optimizations (Sun 3/140) 



integration (tail), to see whether function integration is an improvement over tail-recursion 
elimination. For the purposes of comparison, wc compiled the benchmark programs with no 
optimizations (unopt) and with all optimizations and safe arithmetic (opt) The benchmark 

r*,r h shown “ T f 1 • wc - - -p-i— « a z 

compiled by gcc with the -O option. 

To demonstrate whether our optimizations enhance or interfere with each other we have 
listed the sum of the speed-ups over unoptimized code for each individual optimization 
versus the speed-up for all optimizations over unoptimized code in Table 6. 

We see that in all cases, except fibonacci, the speedup obtained with all optimizations is 
^ater than the sum of the spccdups for individual optimizations. This is to be expected from 
e interaction between function integration and register caching. With function integration 

for riS lo SP w‘ UCS C K ChCd ^ ' 0Cal VariablCS l ° gl ° bal V3riables Icss frc q^ntiy, especially 
We lS? loops. We attribute the dramatic speed-up in the lexer generator to this interaction 

e speculate that the interference shown in fibonacci is due to the poor performance of the 



Benchmarks 


individual speed-ups 


speed-up (all optimizations) 


insert 


16.9s 


17.6s 


fib 


74.4s 


59.4s 


ml-yacc 


40.7s 


52.7s 


ml-lcx 


72.7s 


191.0s 



Table 6: Sum of individual speed-ups versus speed-up for all optimizations at once 
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static textual count heuristic (for register caching) when the function integration optimization 
is enabled. 

8 Conclusions and Future Work 

We can now answer the questions that were raised at the beginning of this paper. Yes, it 
is possible to compile Scheme-like languages without using any assembly language. And, 
yes, it is possible to develop portable implementations of such languages without sacrificing 
either proper tail-recursion or efficiency. We offer our system, sml2c, as an existence proof. 

We would like to point out that even though C is quite close to assembly language, there 
is sufficient semantic distance between the two to create significant obstacles to using C as 
a target language. The biggest problems are the lack of first-class labels and the arbitrary 
limits placed on input programs by most C compilers. 8 

We have several potential optimizations that we plan to evaluate and possibly incorporate 
into the compiler. We also plan to build a tool to generate a detailed trace of program 
execution at CPS level. We hope to use this tool to perform quantitative analyses of 
proposed and implemented optimizations. 
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